CONOUBWN FE 


. Introduction 

. Project Overview 

. Breakdown of a Neural Network 
. Different Types of Layers 

. Advantages of Convolution 

. Results 

. Conclusion and Future Steps 

. Poster and Code 


Introduction 


Introduction 

Because of the world’s current affinity towards data-driven industries, 
machine learning has become an exceedingly popular area of study and 
research in recent years, especially in regards to image processing and 
recognition. The most common learning tools in image recognition are 
neural networks, which consist of a series of connected layers of “neurons”. 


In a standard neural network, each layer takes an input vector, every 
element of which is connected to each neuron in the layer with a specific 
“weight”. Moreover, each neuron in a layer has a specific “bias” associated 
with it, designed so that the neuron will only produce a meaningful output if 
the linear combination of weighted inputs (the neuron’s excitation) is 
greater than that bias. This output is determined by passing the neuron’s net 
excitation to an activation function. With the inputs propagating through 
each layer of the network in this fashion, the neural network produces an 
output corresponding to both an input and the parameters (weights and 
biases) of the network. The network then learns through a Stochastic 
Gradient Descent algorithm, which updates all of the network’s parameters 
in attempts to minimize a cost function that defines the relationship between 
a network’s produced output and a desired output. This entire training 
process is repeated for a specified number of epochs to improve accuracy. 


A convolutional neural network (CNN) is similar to a standard neural 
network except it adds convolutional layers at the beginning. Convolutional 
layers arrange neurons into grids, and convolve those grids with input 
images. The parameters of the convolutional layers are the weights of each 
neuron in each filter’s kernel and the bias applied to each filter. The output 
of the convolutional layers is then passed to the fully connected layers of a 
standard neural network. 


Our project focused on applying convolutional neural networks to 
handwritten digits, allowing us to build a system that was able to recognize 
the digits 0-9 with a great deal of accuracy. Because of the simplicity of this 
problem, we were better able to understand the structure of the tool used to 
solve the problem. With this deeper understanding of convolutional neural 
networks, we are better equipped to solve more complex problems. 


Project Motivation 

Because machine learning has become such a significant area of study and 
research, we were interested in pursuing a project that would give us an 
introduction to the concept. This, and because of our collective interest in 
image recognition, is why we chose to explore convolutional neural 
networks. We chose handwritten digits as our dataset because it was a set 
readily available to us, and because it was a simple enough problem where 
we could focus our efforts on understanding the learning process, network 
structure, and functionality of each layer rather than having to waste effort 
delving into the nuances of a complicated dataset, such as facial or animal 
recognition. 


Previous Work 

Convolutional neural networks are an extremely saturated field, with many 
papers and studies done on them. Many different methods for improving 
their performance have been explored by researchers; such as independently 
training a set of networks and having them vote on the most likely output. 
We used the results of all of this research to formulate a structure for our 
network and to postulate new methods for improving performance, such as 
knowledge transfer between networks. 


Project Overview 

After tinkering with hyper-parameters, various cost functions, and different 
activation functions, we built a convolutional neural network ourselves 
using Python's Theano library and trained it to recognize handwritten digits 
from 0-9. 


Overview 

We built our network to detect handwritten digits between 0 and 9, 
comprised of 6 total layers (seen below). We trained and tested our network 
using the MNIST data set of handwritten digits (50,000 training images, 
10,000 test images). During the training, the weights and biases of each 
neuron in each layer were adjusted to minimize the cost function we used in 
the stochastic gradient descent algorithm. The first layer in our network is a 
convolutional layer that takes a 28x28 image and convolves it with 20 
different 5x5 filters creating 20 24x24 feature maps. The second layer — a 
pooling layer — samples the 24x24 feature maps down to 12x12 feature 
maps. The third layer is another convolutional layer that convolves 40 
different 4x4 filters, tuned to detect higher- level features, into 40 8x8 
feature maps. The fourth layer pools the output of the second convolutional 
layer into 4x4 maps. The fifth layer is a fully connected layer of 100 
neurons that detects the presence of features in the 4x4 feature maps. 
Finally the 6th layer is a softmax layer that maps combinations of features 
from the 5th layer into the output, a length-10 vector where the index of the 
largest entry was the digit our network guessed. 


Our Convolutional Neural Network 


Neural Network 


1 image 
Input Image 
- . | 28x28 
20 maps 


Convolutional 


Layer 1 ies 3 24x24 
| 
Pooling Layer 1 Si sai 20 maps 
12x12 
Convolutional ia yn pe maps 
ners x = | = 
Se Kah ahaa aS ean a 
Pooling Layer 2 LS i z | =e _ maps 


- . F Per 


Layer a ie ee ie ie oe 100 neurons 
SS 


Softmax ” 
Layer w7wvuVvue vow wae @ w@ WS wy Wnewon 


w~ 
a O00 1S ©) | 


le Connected 


The CNN we structured to recognize 
handwritten digits 


We also designed an applet that allows a user to visualize the flow of data in 
our network. It lets the user see the effect of our network on an input image 
(either an image in the MNIST dataset or an image drawn on the screen) 
and see the output of each layer of our network in real-time. We also 
designed the GUI to display the trained filters of each convolutional layer 
and the outputs of each neuron in every layer. This allowed us to gain a 
visual understanding of how the filters in our network were trained, which 
characteristics the filters were trained to detect, and how the filters used 


convolution to recognize the different features of the digits. Additionally it 
allowed us to test and prove just how accurate our network was, as well as 
easily identify and fix any cases where our network did not successfully 
classify digits. 

Our Image Recognition App 


ESEEEREE Onn 
SamBecmenEe 


™ 


The applet we created to visualize the flow of data through our CNN 


Breakdown of a Neural Network 

This module breaks down a standard neural network, describing the 
different parameters, hyper-parameters, and functions that are necessary for 
building a neural network. 


Cost Functions 

The different cost functions we explored using were the sigmoid function, 
rectified linear units (ReLU), and the softmax normalization function. 
Sigmoid Function 


(2) = ——— 


l1+e>* 


Equation for the sigmoid 
activation function 


The sigmoid activation function is the most general nonlinear activation 
function used in neural networks. Intuition would naively suggest that the 
activation of a neuron would be well modeled by the step function, but the 
issue is its non-differentiability. The stochastic gradient descent algorithm 
requires that activation functions be differentiable. The solution would be to 
approximate the step function using a smooth function like the sigmoid or 
the hyperbolic tangent. The issue with the sigmoid function is that its 
derivative far from the origin is near zero, so if any individual weight on a 
neuron is very wrong, it is unable to use the gradient to adjust its value. As 
a result, outlier weights can significantly impact the performance of the 
network. 

Rectified Linear Units 


ReLU(x2) = max(0,x) 


Equation for the ReLU activation 
function 


The advantage of using rectified linear units is threefold. First, its derivative 
is a constant (either 0 or 1) making the computation of the gradient much 
faster. Second, it is a better approximation of how biological neurons fire, in 
the sense that there is no activation in the absence of stimulation. Third, 
rectified linear units speed up learning by not being able to fire with zero 
net excitation. This means that if an excitation fails to overcome a neuron’s 
bias, the neuron will not fire at all. And when it does fire, the activation is 
linearly proportional to the excitation. The sigmoid function in comparison 
allows for some activation to occur with zero and even negative net 
excitation. However, a lower learning rate needs to be used with ReLU 
because its zero derivative for a net excitation less than zero means that the 
neuron effectively stops learning once its net excitation hits zero. 

Softmax 


Mij(x) = 


Equation for the 
softmax activation 
function 


Softmax activation is particularly useful on the output layer, as it 
normalizes the output. Exponentiating each of the net excitations gives a 
more dramatic representation of the differences between them. Weak 
excitations become weaker activations and strong excitations become 
stronger activations. Everything is then normalized, giving the layer the 
effect of becoming a decision-maker. 


Activation Functions 

The different cost functions we explored using for the gradient descent 
learning algorithm were mean-squared error, cross-entropy, and log- 
likelihood. 

Mean-Squared Error 


| va F 
MSE(2) = ; — 2,)° 
(a) - a r; ) 


Equation for the mean-squared error 
cost function 


Mean-squared error is the simplest measurement of difference that can be 
used to good effect in a neural network. It can be used with any activation 
function and is the more versatile option, though not always the most 
effective one. One of its shortcomings is that neurons with a sigmoid 
activation function become saturated quickly and are unable to learn more 
as aresult of the relatively small magnitude of the sigmoid’s derivative far 
from the origin. 

Cross-Entropy 

Ci(x,y) = —(xi-log(yi) + (1 = 24) log — yi) 


Equation for the cross-entropy cost 
function 


Cross-entropy treats the desired output as some probability distribution and 
the network’s output as another probability distribution, and measures the 
distance between the distributions. The main attraction to using cross- 
entropy is that when used in conjunction with the sigmoid activation 
function, its gradient is linearly dependent on the error, solving the issue 
with neurons becoming saturated quickly. 

Log-likelihood 


1 nr 
L(x, y) = a > log(xy) 


Equation for the log-likelihood 
cost function 


Log likelihood maximizes only the output neuron corresponding to which 
neuron should be firing. Used in conjunction with a softmax layer, all other 
output neurons would be minimized as a result of maximizing the desired 
output neuron. In this sense, a softmax layer has to be used, or the 
activations of the final layer will be too close together to draw meaningful 
conclusions. 


Stochastic Gradient Descent 
Stochastic Gradient Descent 
w, = wy — 1VC(w,) 


Equation for the SGD learning 
algorithm, applied to both the weights 
and biases. 


Stochastic gradient descent is the algorithm used in our network to adjust 
weights and biases according to the evaluation of the gradient of a given 
cost function. The gradient determines whether a parameter should increase 
or decrease and by how much. The learning rate of a network is a constant 
associated with how much a parameter should travel down its gradient at 
each reevaluation. In the original algorithm, parameters are updated after 
each given input. A common practice with neural nets is to only reevaluate 
the gradient after a so-called minibatch of multiple inputs is passed. This 
way, the cost function has multiple samples and can better construct a 
curve, yet the gradient is somewhat different every time it’s evaluated. This 
introduces some noise into the gradient to make it harder for parameters to 
get stuck in a local minimum of the gradient. 


Dropout 

Overfitting is an issue experienced in networks when neurons are trained to 
identify specific images in a training set rather than the more general 
concept that an image represents. Instead of recognizing a 7, the network 
may only recognize the particular 7s that were in the training data set. To 
prevent this, we implemented dropout in our network. Random neurons in 
our interconnected layers were turned off between mini-batches, meaning 


that certain weights were not able to be used in determining an output. This 
essentially means that we were training a slightly different network each 
mini-batch, encouraging more neurons to learn meaningfully, as weights 
will typically be more fairly distributed. In evaluating the network, all 
neurons are turned back on and their weights are scaled down by the 
dropout rate. As a result, neurons are less strongly associated with particular 
images, and more applicable to a more expansive set of images. 


Different Types of Layers 
This module describes each of the different types of layers we employed in 
our convolutional neural network. 


Convolutional 

Convolutional layers produce output feature maps by convolving an input 
with each of its kernels, trained to recognize different characteristics. Each 
kernel is an arrangement of weights into a square filter. The first 
convolutional layer in our network convolves the input image with a set of 
20 5x5 kermels to produce 20 feature maps and the second convolutional 
layer convolves the input (a set of pooled feature maps) with 40 4x4 kernels 
to produce higher-level feature maps. Each neuron in our convolutional 
layers uses ReLU (Rectified Linear Units) as its activation function. 


The filters in the convolutional layers were trained to recognize particular 
features. The first convolutional layer detects features such as edges and 
total “mass” of the image, while the second convolutional layer detects 
higher-level features including the intersections of features detected in the 
first layer. The features that each kernel detects were trained through the 
learning process, where the weights in the kernels were updated during the 
SGD algorithm. 

2D Convolution 


input neurons 


Each filter in the convolutional layers produces a feature map 


using 2D convolution as above. 


Pooling 

Pooling layers produce an output by reducing the size of its input using 
some function. The output of each convolutional layer in our network is 
used as the input to a pooling layer. The pooling layers take 2x2 regions of 
the input and pass the maximum value of each region as its output. In this 
way, the pooling layer effectively reduces the size of the data being handled 
in the network while still preserving the important features that were 
detected in the convolutional layers. Significant activations of neurons are 
preserved as a result of taking the maximum value in a region. 

Pooling 


hidden neurons (output from feature map) 


max-pooling units 


pole) 
COTOOCO OU OUD UU UO — ED 


The pooling layer samples each feature map into a smaller 
map as above. 


Fully-Connected 

Neurons in fully connected layers are connected to every neuron in the 
previous layer and every neuron in the next layer. Each connection has a 
corresponding weight and bias associated with it. The last 2 layers in our 
network are both fully connected layers. The first fully connected layer 


detects the presence of the higher level features found in the second 
convolutional layer, using the ReLU activation function. The second fully 
connected layer is the softmax layer, using the softmax activation function. 
Fully Connected Layers 


inputs output 


SA OX 9 

BLO 

EOE 
UO 


Example of three fully connected layers. 


Advantages of Convolution 
This module outlines the advantages of adding convolutional and pooling 
layers to a standard neural network for applications of image recognition. 


The main motivation behind using convolutional layers is that it is typically 
true of images that pixels in close proximity are more related with each 
other than with pixels that are a greater distance away. Thus, compared to 
fully connected layers, convolutional layers give a better indication of 
general features that appear in an image by taking advantage of this spatial 
structure of images. 


Shift Invariance 

A major shortcoming of fully interconnected networks is their dependence 
on position of a feature in an image. Such a network would recognize an 
image, but not its slightly shifted self. Training shift invariance in a fully 
connected is network is possible, and involves extensive expansion of 
training data, but it’s significantly more efficient to use convolution, which 
naturally has this property. Convolutional layers detect a given feature, 
regardless of its position on an image. Because the MNIST data set is 
centered and normalized, a fully connected network can still work, but a 
network with convolutional layers is able to handle data that is not properly 
centered or normalized. 


Computationally Efficient 

Another consequence of using convolutional networks is that there are 
fewer parameters involved, making the network more computationally 
efficient to train. For any given neuron in a fully connected hidden layer, 
there is a weight and a bias associated with each neuron in the previous 
layer, and as such, the number of parameters scale as the number of neurons 
squared, assuming a similar number of neurons in each interconnected 
layer. This makes it incredibly difficult and computationally inefficient to 
implement deep neural networks consisting of only fully connected layers. 
Convolutional layers by contrast only have 1 bias per kernel and 1 weight 
for each pixel of each kernel. A neuron in the following layer is only 
connected to the number of neurons specified by the size of the kernel. Now 
instead of scaling as n squared, parameters scale as the number of kernels 
times the size of each kernel. 


Results 
This module details the results our CNN obtained and observations we 
made about those results. 


Results 

The primary result obtained from the implementation of a deep 
convolutional neural network was its substantial advantage over fully 
connected networks in terms of generality, efficiency, and accuracy. Trained 
against just the MNIST data set, our convolutional network managed an 
accuracy of 99.39%, while a fully connected network with a number of 
parameters almost an order of magnitude higher managed only 98.03% 
accuracy. The real advantage lies with user input. The GUI we designed to 
take user input and evaluate it through our network shows that 
convolutional networks handle image transformations like shifting, scaling, 
and rotations much better than fully connected networks. This is mostly due 
to the fact that the numbers in the MNIST data set were centered and 
normalized. We manually performed image manipulations in pre-processing 
to expand the training data in an attempt to train our networks to look for 
image transformations. Even given this expanded data set, the fully 
connected network performed more poorly than the convolutional network 
when given user input that wasn’t centered and normalized. A peculiarity 
about the MNIST set itself is the way some the contributors wrote their 
numbers. For example, a large amount of the 6s in the set resembled a 
lowercase phi (@). This isn’t necessarily representative of the way people 
generally draw 6s and just emphasize the importance of having a 
comprehensive training data set. 


In actually constructing our network, we cycled through each of the 
activation functions and cost functions finding the combinations that 
resulted in the greatest accuracy. We found using ReLU with a log- 
likelihood cost function in conjunction with a softmax output layer gave us 
the best results, so that is the structure we used in our network. Another 
component we played with was the learning rate. Each cost function created 
a different gradient, so we had to be careful in how quickly we traversed 
that gradient. When we noted a network that did not learn well we 
concluded that we were likely caught in a local minimum of the gradient 


and we had to lower the learning rate in our subsequent networks to combat 
this. 


Another interesting takeaway was the observation that many of the kernels 
produced by the convolutional layers resemble prototypical image 
processing filters like edge detectors in various orientations. This 
observation is particularly interesting because it leads to additional 
discussion about whether it would be possible to manipulate these kernels 
somehow. 


Conclusion and Future Steps 
This module outlines the conclusions we made about CNNs and the future 
steps we could take with our project. 


Conclusion and Future Steps 

Given our results, we conducted some additional exploration into 
manipulations of the kernels produced by the convolutional layers. One 
such manipulation was the production of a library of useful looking filters 
generated by multiple separately trained networks, and the injection of these 
filters into the first convolutional layer of an untrained network, then 
training the network. We experimented by freezing the initial layer so the 
weights and biases associated with the kernels in the first layer would 
remain as the ones that were injected, and proceed to train the network. The 
result was a substantial speedup in learning, reaching an accuracy of 
95.43% after a single epoch, compared to the 80.21% and 60.32% obtained 
by the networks that generated the kernels that were injected into the test 
network. Unfreezing the layer and continuing training did not yield any 
useful results. The accuracy of the network did not increase substantially 
more after additional training. The primary conclusion is that initializing 
weights and biases in the convolutional layers to ones previously learned in 
a separate network can substantially decrease the time it takes for a network 
to learn. A future elaboration would be to not even generate kernels using 
previously trained networks, but to initialize weights using prototypical 
image processing filters. If convolutional nets are able to be initialized with 
useful filters initially, they may be able to substantially reduce the amount 
of time required for a network to learn, or perhaps increase accuracy. 
Kernel Library 


ee eo ee 
PreeGQvPa Ras & 


Library of selected, "interesting looking" kernels. 


Acknowledgements 


e Our mentor Mayank Kumar, PhD student. Rice University 


e Nielsen, Michael. Neural Networks and Deep Learning. 
http://neuralnetworksanddeeplearning.com/index.html 


Poster and Code 
For those interested, here's our poster and a link to our code. 


Code 
Our code can be found here 


RICE 


€Ce 


redefining limits 


ken 3 


cormaatee retat 
Fitter detecting total figure mass 


| = 
ton 


torentapes tenet 

Top Saketestrumbers Filter detecting lower edges of the gure 
After training the network, it learned to recognize a 
variety of critical features, such as those above. Our 
final trained network was able to correctly identify 9939 
out of 10,000 test images, yielding an accuracy of 
99.39%, far superior to what we achieved without 
convolutional layers (98.03%). 


Handwritten Digit Recognition Using 
Convolutional Neural Networks 


Justin Pensock, Ethan Myers, Cody Tapscott, Austin Hudelson 
jep6@rice.edu, emm6@rice.edu, catlO@rice.edu, ash4@rice.edu 


_______ Objective | |___FeatureMaps 


Explore the field of image-recognition by 
examining convolutional neural networks (CNNs); 
we will analyze how adding convolutional layers 
to a network improves its accuracy and how 
changing various parameters and functions 
impacts the network's performance. 


* Produced using 2d convolution of a filter and input 
+ Filters initialized randomly using normal 
Gaussian distribution 
* Filter weights are tuned through learning 
* Recognizes specific features in an image 
+ Feature sets are created through training 
+ Features include curves, intersections and 
total image mass 


3 


Inet Image 


We examined the individual components that 
make up 2 convolutional neural network: 

* Hidden layers 

* Feature maps of the convolutional layers 


Then we used our results to train a network and 
recognize digits in a GUI app we developed 


_____Hidden Layers 


Convolutional Layer (20,560 parameters) 
* Convolves an input image with a filter of 
weights to produce feature maps 
+ Allows for shift-invariant identification of 
features 


Hii 


input Image 


Cenvolutional layers, by only looking at local pixel 
neighborhoods, take advantage of an image's spatal 
structure, allowing us to minimize a network's 


Pooling Layer (0 parameters) 

+ Subsamples convolutional layers by taking 

maximum value in a small neighborhood 
Fully Connected Layer (64,100 parameters) 

* Each input is connected to every neuron in 
the previous layer, creating a function based 
on the type and location of features 

Softmax Layer (1,010 parameters) 

+ Afully connected layer that normalizes the 
outputs so that only one output can be high 
at a time 


eee eth fp rake 


“i ~@-@-@-@-0-0-0-0 


parameters and feasibly train deeper networks 


Convolutional layers in neural networks improve 
overall effectiveness per parameter. Our network had 
85,670 parameters; high-performing fully connected 
networks that solve the same problem have nearly 
636,000 parameters!) 


____Acknowledgements __| 


+ Our mentor Mayank Kumar, PhD shatent. Rice Ureversty. 
*  Nesisen, Michoel, Neural Networks and Oeep Learring. 


<nip me urninetworksanddee pinarreng conmvindex Mire 


+ (1) Semard Py, D. Steirkraus, and J.C Platt, Best Practices for 


Comwolutiona! Neural Networks Appied to Vera! Ooaument 
Analyses” 


